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Abstract 

Background: Recessive genes cause disease when both copies are affected by mutant loci. Resolving the cis/trans 
relationship of variations has been an important problem both for researchers, and increasingly, clinicians. Of 
particular concern are patients who have two heterozygous disease-causing mutations and could be diagnosed 
as affected (one mutation on each allele) or as phenotypically normal (both mutations on the same allele). Several 
methods are currently used to phase genes, however due to cost, complexity and/or low sensitivity they are not 
suitable for clinical purposes. 

Methods: Long-range amplification was used to select and enrich the target gene {CYP21A2) followed by modified 
mate-pair sequencing. Fragments that mapped coincidently to two heterozygous sites were identified and used 
for statistical analysis. 

Results: Probabilities for cis/trons relationships between heterozygous positions were calculated along with 99% 
confidence intervals over the entire length of our 10 kb amplicons. The quality of phasing was closely related to 
the depth of coverage and the number of erroneous reads. Most of the error was found to have been introduced 
by recombination in the PGR reaction. 

Conclusions: We have developed a simple method utilizing massively parallel sequencing that is capable of 
resolving two alleles containing multiple heterozygous positions. This method stands out among other phasing 
tools because it provides quantitative results allowing confident haplotype calls. 
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Background 

The use of diagnostic gene sequencing has dramatically 
increased during the last two decades. However, accurate 
interpretation of sequencing data remains a challenge, 
despite technical advances. One common problem is un- 
certainty about the cis/trans status, or phase, of hetero- 
zygous variations. Properly phased genomic information is 
frequently required for accurate diagnosis of recessive 
genetic diseases. The scale of this problem is considerable, 
as indicated by a recent query of the Online Mendelian 
Inheritance in Man (OMIM) database which revealed 
over 250 recessive genes known to be associated with 
more than 1,100 disorders [1]. Unfortunately, Sanger 



* Correspondence: vasmatzis.george@mayo.edu; grebe.stefan@mayo.edu 
^Department of Molecular Medicine, Mayo Clinic, Rochester, MN 55905, USA 
^Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, 
MN 55905, USA 

Full list of author information is available at the end of the article 

(3 BioMed Central 



sequencing, the most widely used technique and current 
gold standard, is incapable of separating phases without 
allele-specific capture or allele-specific amplification. 

While this problem has long been recognized, a simple 
and effective solution has remained elusive. Computa- 
tional methods have been developed to estimate haplotype 
sequences based on the individuals genotype compared to 
a population [2], but they lack the resolution and accuracy 
needed for clinical use. 

A more definitive approach for genetic phasing is 
based on manipulation of single chromosomes, either 
through cell hybrid systems, using conversion technol- 
ogy [3,4], or by means of size-exclusion devices [5]. 
While this strategy is perhaps the most reliable for gen- 
erating accurate haplotype sequences, it is by far the 
most labor intensive approach. It is also error and failure 
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prone, due to its lengthy, complex and technically diffi- 
cult workflows. 

More recently, the phasing problem has been tackled 
using massively scaled Next Generation Sequencing 
(NGS). Briefly, these methods depend on the creation of 
at least 100 libraries from each patient using techniques 
such as bacterial fosmid construction or multiple dis- 
placement amplification [6,7]. Libraries are indexed, 
pooled, sequenced and then computationally combined 
into two haplotype consensus sequences. While these 
methods are powerful for generating phased sequences 
for entire genomes, they are cumbersome, slow and cur- 
rently expensive. 

Since each of these approaches is in some way unsuit- 
able for routine clinical use, current protocols for solving 
cis/ trans questions typically involve testing of family 
members. This is a costly and time consuming under- 
taking that may still fail, if there is insufficient genetic 
diversity in the tested familial cohort. As an alternative, 
allele-specific PGR can be employed. However, the cost 
and effort required to design and validate assays makes 
this prohibitive in genes where there are many possible 
combinations of mutant positions. 

Revisiting NGS techniques, with a view to creating a 
simpler solution than multiple indexed library sequen- 
cing, could provide an attractive solution to the phasing 
problem, in particular as NGS is now starting to replace 
Sanger sequencing in clinical applications. Because NGS 
methods are based on deriving sequences from a single 
molecule, one should be able to adapt the methodology 
for accurate phasing of genomic sequences. Most of the 
current platforms use a paired end (PE) protocol in 
which a string of sequence is read from either end of a 
larger DNA fragment. Since the reads come from oppos- 
ite ends of the same fragment and are linked through a 
continuous strand of DNA, we refer to them as linked 
reads. Given their linked nature, any variations detected 
in the same fragment are cis to one another. 

The current Illumina PE library sequencing protocol 
restricts library fragment size to 250-500 bp because 
longer fragments decrease the quality of data through 
overlapping and reduced density of clusters. Coverage of 
larger distances between nucleotide positions of interest 
can, however, be achieved through the mate paired (MP) 
library protocol. This protocol initially utilizes larger 
genomic fragments of 2-5 kb that are self-ligated prior 
to a secondary fragmentation to the conventional PE 
library size centering on 500 bp (Figure 1). Biotinylation 
of the termini of larger fragments prior to circularization 
enables the isolation of DNA containing the ligated ends. 
Sequencing of these fragments containing junction points 
thus generates paired reads that are linked across much 
greater distances than in conventional PE libraries, at the 
expense of some loss in coverage for short inter-variant 




Figure 1 Mate pair library preparation. The MP protocol allows 
sequence information to be linked across greater distances than PE 
reads. Fragments of sheared DNA from a pool (500 - 5000 bp with 
an average of 2000 bp) are end-repaired using biotinylated nucleo- 
tides. Fragments are then self-ligated and all remaining linear DNA is 
removed by exonuclease treatment. Circularized DNA is fragmented 
again (black bars) to an average size of 500 bp and segments containing 
biotinylated junction points are isolated on streptavidin beads. In 
addition to fragments containing junction points, a portion of 
non-biotinylated DNA is co-purified and appears in the MP library 
as a subpopulation of PE reads. All fragments are end-repaired and 
indexed using TruSeq adapters followed by sequencing. 

V J 

distances. A combination of PE (100-600 bp) and MP 
(500-5,000 bp) libraries over a defined gene region could 
therefore complement each other in terms of phased 
coverage and should allow accurate determination of cis/ 
trans status of multiple sequence variants over a relatively 
large range of distances. 

We tested this supposition using the CYP21A2 gene as 
a model system. This gene is commonly sequenced 
during diagnosis of congenital adrenal hyperplasia (CAH). 
The combination of the modest length of this gene 
(-3400 bp), a rate of at least 10% compound heterozygos- 
ity for mutations or variants of unknown significance in 
patients, and availability of genetic family studies in most 
cases, make CYP21A2 a suitable model system as a proof 
of principle test of our approach. 

Methods 

Long-range PGR 

CYP21A2 is located in the HLA region on chromosome 
6p2.13. An inactive yet highly homologous pseudogene 
{CYP21A1P) is located 30 kb upstream and has been 
known to confuse genotyping assays for CYP21A2 [8,9]. 
To enrich our mate pair library with the active gene and 
eliminate the pseudogene we performed long-range PGR 
(IrPCR) using unique priming locations around CYP21A2. 
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Priming sequences were 5'-AGTGGGGCTCTGAAGAC 
TGA-3' for the forward position and 5'-CCCTCGGGA- 
GATGATCTGTA-3' for the reverse to amplif)^ a clean 
10 kb product (Figure 2). LA Taq and associated buffers 
from TaKaRa were used in the reaction at their re- 
commended concentrations. Approximately 150 ng of 
template DNA was used in the PGR reaction. Cycle condi- 
tions were as follows: 95°C for 5 m; 10 cycles of (95°C for 
30 s, 60°C for 30 s, 72°C for 10 m); 20 cycles of (95°C for 
30 s, 55 for 30 s, 72°C for 10 m); 72°C for 20 m. 

Whole genome amplified background DNA 

The MP protocol is driven towards intra-molecular 
circularization over inter-molecular ligation of two sep- 
arate DNA fragments simply by spatial dilution. In order 
to minimize the complication of inter-fragment ligations 
from a limited sequence amplicon input we investigated 
spiking of the 10 kb amplicon at four different concen- 
trations into a background of whole genome amplified 
(WGA) DNA from a normal individual. WGA DNA was 
used due to its similar average fragment size to the 10 kb 
IrPCR amplicon, than conventional extracted genomic 




Figure 2 Clean 10 Kb amplicons. A single, clean band at 10 Kb 
shows the specificity of our long-range amplification. 

v J 



DNA preparations (-50 kb), making downstream fragmen- 
tation in the library prep protocol more predictable. This 
approach additionally enabled us to evaluate the role of 
amplicon concentration on inter-fragment ligation. Back- 
ground WGA DNA was generated from genomic DNA 
using a Qiagen Repli-g midi kit according to recommended 
protocols. Background and amplified DNA concentrations 
were measured by fluorescence on a Qubit fluorometer 
(Invitrogen), and 10, 100, 500, and 1,000 ng aliquots of 
IrPCR product were spiked into WGA DNA to a total of 
5 ug for each library preparation. 

Library preparation and sequencing 

MP libraries were prepared for each spiked pool of 
IrPCR product and WGA background DNA based on 
previously reported protocols [10]. Each pool was frag- 
mented on an E210 Focused-ultrasonicator (Covaris) to 
fragments ranging from 500 to 5000 bp with an average 
of 2000 bp. FoUowing purification on Qiaex II beads, 
DNA fragment ends were repaired and biotinylated 
using a mixture of natural and biotinylated dNTPs. 
Excess reagents and by-products were removed using 
Qiaex II beads. Six-hundred ng of DNA from each pool 
were circularized in 16 hour ligation reactions at 30°C 
prior to exonuclease treatment at 37°C for 20 minutes to 
digest any remaining linear strands of DNA. The circu- 
larized DNA was then fragmented to 300-500 bp using 
the M220 Focused-ultrasonicator. Streptavidin beads 
were applied to isolate ligation junction fragments. End 
repair, blunt ending and adapter ligation were performed 
whfle fragments were bound to the beads. PGR was 
performed to produce bead-free fragments which were 
subsequently assembled into indexed MP libraries using 
TruSeq adapters (Illumina). While streptavidin beads 
provide good recovery of biotinylated DNA, they also 
co-purify a fraction of unlabeled fragments from other loca- 
tions in the sheared, circularized DNA. We used this to our 
advantage by allowing these fragments into our libraries to 
provide PE reads covering positions 100 to 500 bp apart. 

The four final indexed MP libraries were purified and 
analyzed on an Agflent Bioanalyzer DNA 1000 chip 
before equimolar pooling. The sample was loaded onto a 
single lane of an Illumina flow cell and sequenced to 
101x2 paired-end reads on an lUumina HiSeq. Base 
caUing was performed using Illumina Pipeline vl.5. 

Sequence reads collected from the Illumina were 
demultiplexed and mapped to the hgl9 assembly [11] 
using a custom mapping algorithm similar to the one 
used in previous publications [12,13]. To avoid the 
problem of reads from the amplified region erroneously 
mapping to the pseudogene, CYP21A1P, and/or homolo- 
gous surrounding areas, the region from chromosome 6 
between 31971000 and 31982000 was removed from the 
reference sequence. 
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Statistical analysis 

After mapping and alignment of linked reads covering 
two heterozygous positions a matrix was constructed to 
quantify the associations between every possible pair of 
base calls between the two positions. Confidence inter- 
vals for all base calls were calculated by bootstrapping 
based on the observed frequency of base calls in each 
association matrix. For each upstream base call (associ- 
ation matrix rows), a probability distribution was 
constructed for all possible downstream base calls (asso- 
ciation matrix columns). Observed counts in each row 
were converted to probabilities and used for multi- 
nomial resampling with the total number of samples set 
to the sum of observations in the row. In addition to the 
observed probabilities, 1% was distributed across each 
row to simulate random error associated with NGS 
sequencing. Following every cycle of sampling, the 
counts for each base call were converted to probabilities 
and used to construct a set of distributions. After 1000 
sampling iterations, confidence intervals were set for 
each possible downstream base call by ranking the 
resulting probabilities for that base and selecting the 1% 
and 99% values from the distribution. 

For haplotyping regions longer than the span of PE or 
MP fragments several association matrices can be 
chained together. In this case, bootstrapping for each 
individual matrix was performed as described above. 
The Unkage between pairs of heterozygous positions 
followed a Markov Chain model in that the probability 
of association between two base calls was unrelated to 
previous base calls in the chain. To begin the chain, 
association matrix Aj was constructed between two 
heterozygous positions, ho and hj. One of the two bases 
was arbitrarily chosen from ho, and probabilities and 
confidence intervals for each base at hj were calculated 
as described above. Next, association matrix A2 was 
constructed between positions hj and /z2. The base with 
highest probability at hj from Aj was selected and 
probabilities for association with this base at in 
were calculated. By iteration of this cycle, a chain of 
associated base calls can easily be made for one allele. 
To validate the results from one allele, the opposite 
allele can be phased by selecting the alternate base at ho 
and crosschecking the two resulting chains. 

To quantif)^ the confidence of association between two 
distant heterozygote calls, a cumulative probability was 
calculated as the product of all prior probabilities in the 
associated chain. Cumulative confidence intervals were 
also calculated from a distribution made from the 
products of each previously occurring bootstrap result. 
Using these measures, the limits to the length of chained 
phasing become apparent when the confidence intervals 
of rejected base calls begin to overlap with the cumula- 
tive interval. 



Results 

To demonstrate that a combined PE and MP sequencing 
strategy could allow us to accurately phase compound 
heterozygous sequence variants over a significant genomic 
distance, we divided the problem into three components. 
First, we performed experiments to determine the ne- 
cessary conditions for adequate sequence coverage and 
showed proof of principle of accurate variant phasing, 
using CYP21A2 as a model system. Next, we demonstrated 
that the analysis can be extended to phase DNA fragments 
across distances that are much larger than those included 
in the MP library. Finally, we explored the principle sources 
of experimental error. 

Phasing a single pair of heterozygote sequence variants 

Confidence of NGS base calls is a function of coverage 
at a given position. Since our strategy requires accurate 
association of two heterozygous positions (a total of 4 
base calls), high coverage is required throughout the 
target region. To this end, we designed a long range- 
PCR (IrPCR) for enrichment by amplification of the active 
gene CYP21A2, while excluding its highly homologous 
pseudogene, CYP21A1P. 

While enrichment boosts coverage, it also increases the 
likelihood that two fragments of DNA from opposite 
CYP21A2 alleles will be ligated together during MP library 
construction. This event would generate false cis associa- 
tions between loci. We reasoned that we could reduce the 
probability of inter-allelic recombination by adding an 
excess of background genomic DNA to the gene specific 
IrPCR product, biasing any recombination towards non- 
target sequences. Libraries made with 10, 100, 500 and 
1000 ng of IrPCR product produced sequence coverages 
of l,600x, 10,900x, 60,400x and 130,500x, respectively. By 
contrast, coverage by MP fragments outside of the ampli- 
fied target region averaged slightly less than 2x. 

Linked reads are of even greater importance for phasing 
than raw coverage is for accuracy. Any linked read method 
for phasing needs to generate an extended distribution of 
fragment sizes. This assures enough depth of coverage 
between any two points within a gene to accommodate a 
broad range of potential distances between heterozygous 
positions. To verify that we had achieved this goal we calcu- 
lated the linked coverage in our NGS data as a function of 
distance A between base positions. For every position x in 
the amplicon, we counted the number of linked reads 
covering both x and :v + A, for A5 from 101 to 3000 bp, and 
then calculated and plotted the average linked coverage. 
Paired end libraries provided linked reads up to 500 bp 
while MP libraries produced a population of fragments 
ranging from about 200 bp to over 3000 bp (Figure 3). 

Previous genotyping of the specimen tested here 
showed two heterozygous disease-causing mutations; 
however their phase was not clear from the Sanger 
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Figure 3 Average linked coverage by PE and MP reads from four libraries. Linked read coverage is sliown from PE and MP reads as a 
function of distance between linl<ed positions for eacli of tine enriclied libraries (10, 100, 500 and 1,000 ng spil<ed IrPCR amplicon in WGA DNA). 



sequences and required family studies. The first mutation, 
C.60G > A introduces a stop codon at amino acid position 
20. The second, IVS2-13A > G is a common spUce 
site mutation in intron 2. Both variants produce trun- 
cated proteins and are associated with the classical 
form of CAH. 

After mapping all of the reads in the library, fragments 
were selected that covered both heterozygous positions 
with their pairs of sequence reads. The base calls at each 
heterozygous position from each fragment were obser- 
ved and compared to establish the relationship between 
the two alleles. Using these base calls, an association 
matrix was constructed to measure the frequency of 
each association (Figure 4a). In each library, the wild- 
type G at position c.60 was most frequently associated 
with a mutant G in the IVS2-13 position. Conversely, 
the mutant A at position c.60 was most frequently associ- 
ated with the wild-type A at IVS2-13. This indicated that 
the two mutants were on opposite alleles, a result that was 
congruent with the conventional phased genotype that 
had previously been established through allelic segregation 
studies of the proband s family. 

The next step was to quantif)^ more precisely how 
confident one could be that the trans phasing result was 
correct. We used bootstrapping for this, calculating 99% 
confidence intervals around the probability of each pos- 
sible downstream base call. The width of the confidence 
intervals therefore, is related to the depth of linked 
coverage between the two mutant sites (Figure 4b). To 
clarify this relationship, we ran simulations for varying 
amounts of coverage using probabilities from a single 
dataset (500 ng amplicon spike) and calculated the aver- 
age width of all resulting confidence intervals. Both the 



simulation and observed confidence intervals indicate 
that coverage above 500 x provides diminishing returns 
in phasing confidence (Figure 4c). 

Extending the method over longer genetic distances 

In our test specimen, the two mutant positions were well 
covered by a subset of PE and MP fragments. However, 
it is likely that in some cases (or in different genes) het- 
erozygous mutations will be separated by more than 
2000 bases. For these situations, we have developed a 
computational method to chain together linked reads by 
constructing association matrices between pairs of sev- 
eral heterozygous sequence positions (normal sequence 
variants, VUSs. or mutations) in tandem through the 
length of the amplified region. Provided there are enough 
heterozygous positions in the specimen that fall within 
the limits of the combined MP and PE libraries, the 
entire amplified region can be phased using this iterative 
approach. Statistical analysis of the phase assignment 
across an entire chain of linked sequence variants is iden- 
tical to the single association matrix, except that a cumu- 
lative probability and confidence interval is calculated 
between the two mutant positions to measure confidence 
in the data used to link the two. Since this cumulative 
measure is the product of all upstream probabilities in the 
chain, its value will decline in proportion to the amount of 
error in each association matrix. This diminishes the 
probability of the final overall phase-call in relation to the 
first. However, as long as there is full separation of the 
confidence limits of the final cumulative phase determin- 
ation from all other possibilities, a confident call can be 
made. It is thus possible to extend the phasing chain 
across the entire 10 kb IrPCR amplicon without any 



Cradic et al. BMC Medical Genetics 2014, 15:19 
http://www.bionnedcentral.conn/1 471-2350/1 5/1 9 



Page 6 of 8 



CD I 



^ I 
CO I 



Associations between mutant 
positions for four libraries 



10 ng 

IVS2-13A/G 







c Q 






1 


0 


1 


0 


c 


0 


0 


0 


0 




0 


0 


2 


0 


T 


0 


0 


0 


0 




500 ng 

IVS2-13A/G 

O c B T 




88 


2 


5 


0 


C 


0 


0 


0 


0 




4 


1 


119 


0 


T 


2 


0 


0 


0 



^ I 

CD I 



^ I 
CD I 



100 ng 





IVS2-13A/G 
n C B T 




18 


1 


1 


0 


c 


0 


0 


0 


0 




2 


0 


12 


1 


T 


0 


0 


0 


0 



1000 ng 



Probabilities for 1VS2-13 base calls 
associated with 60A 



GO 

d 



~ o 
TO 

o 

CL o 





IVS2-13A/G 










B 




CM 




230 


3 


16 


0 


o 


C 


0 


0 


2 


0 






15 


3 


305 


0 


o 


T 


0 


0 


0 


0 


O 



T-i-C 



T4- 




4X 



35 X 



219 X 



r 



G i 



572 X 



Effect of Coverage on 
Confidence Interval Width 



0.35 1 



— Simulated 
• Observeci 




500 1000 1500 

Coverage 



2000 



Probabilities for 1VS2-13 base calls 
associated with 60G 



CO 

d 



~ o 
JQ 

O- d 



CM 

d 



o 
d 



^^c 



4X 



35 X 219 X 

Coverage 



C 



572 X 



Figure 4 Confidence in phasing calls is dependant on coverage, (a) Association matrices from eacli spil<ed library sliow tine relationsliip 
between tine two disease-causing lieterozygous positions in tliis specimen. Higliligliting of tine bases at eacli locus indicates wild-type (green) 
and mutant (red), (b) The probabilities and 99% confidence intervals for all possible IVS2-13 base calls associated with each C.60G/A allele are 
shown for all four amplicon spiked libraries. Coverage increases linearly as spike concentration increases, (c) Average confidence interval (CI) width 
was calculated to evaluate the level of coverage required for confident heterozygote association. A simulation run to test varying coverage and 
observed data points both indicate that coverage beyond 500x provides diminishing returns in CI width. 



Cradic et al. BMC Medical Genetics 2014, 15:19 
http://www.bionnedcentral.conn/1 471-2350/1 5/1 9 



Page 7 of 8 



overlap of confidence intervals, indicating accurate phas- 
ing throughout (Figure 5). 

Sources of error in the association matrix 

Each association matrix contains a small percentage of 
incorrectly linked bases. The sources of error in NGS 
datasets have been previously explored and attributed 
principally to detection error during data acquisition, 
fluorescence spectral overlap and computational mis- 
alignment of reads in highly homologous regions 
[14,15]. Since our protocol includes IrPCR enrichment 
followed by MP library preparation, we also had to con- 
sider the contribution from in vitro recombination 
events that occur during amplification or circularization. 

While a thorough investigation of this type of error is 
beyond the scope of this paper, we were able to quantify 
two types of inaccuracy by constructing association matri- 
ces between every pair of heterozygotes in the CYP21A2 
gene. Recombination events, i.e. MP reads that include a 
base from either allele, were the most common source of 
error. As a percentage of total reads, this type of error 
averaged about 7% and it proved to be constant across 
every combination of PGR product and background DNA 
mix (5.4%, 7.4%, 6.2% and 7.2% error for 10, 100, 500 and 
1000 ng of amplicon input, respectively). In addition, there 
was no change in these error rates as a function of linked 
read length or coverage. This indicates that our initial 
assumption was incorrect; ligation of fragments from 
opposite alleles during MP library preparation did not 
prove to be a major contributor to erroneous base or phas- 
ing calls. Furthermore, because the percentage of recom- 
bination does not change with proportion to the amount of 
amplicon spiked into each library, these events must occur 
prior to library creation, i.e. during IrPCR. These 



observations testify to the reliability of the MP library 
protocol and highlight the importance of high fidelity in 
PGR reactions. 

Finally, some incorrect base and/or phasing call errors 
could not be attributed to recombination artefacts. 
Across all heterozygous pairs analyzed in our data, only 
2% of the total reads fell into this category, a value that 
accords with other reported values for random error in 
NGS data [16,17]. 

Discussion 

Using our MP library approach and subsequent computa- 
tional analysis we have been able to successfully haplotype 
a specific region of interest in an individual who had two 
heterozygous disease-causing mutations. This method is 
an improvement over other available phasing protocols 
because of its simplicity and because of the statistical 
measure of assurance it provides. In regions where cover- 
age is low or where recombination is present in the frag- 
ment library, erroneous phasing calls can easily be made 
by other methods. In addition to these advantages, our 
method provides the ability to phase heterozygous posi- 
tions that are thousands of bases apart. 

Performed as a single protocol, this method is capable 
of acquiring a completely phased genotype for an entire 
10 kb IrPGR amplicon. Target regions of this size can be 
routinely amplified, and 20-30 kb amplicons are achiev- 
able in many instances. In principle, this method could 
also work beyond 10-30 kb, if several overlapping IrPGR 
amplicons are used as starting material, and as long as 
sufficient overlapping MP fragments can be generated 
that share heterozygous positions. An average MP library 
size which exceeds the 2 kb observed in our study would 
be expected to improve the likelihood of finding an 
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unbroken linked chain of polymorphisms, while simultan- 
eously reducing the number association matrices needed 
for complete phasing of a region of interest, thereby 
improving the confidence in the accuracy of the overall 
haplotype. Since the Illumina MP protocol is optimized 
for initial fragmentation libraries of 2-5 kb, such improve- 
ments should be relatively easy to achieve. 

In theory, there is no upper limit to the scalability of our 
approach and it could even be applied to whole genome se- 
quencing, provided sequence coverage and linked coverage 
are high enough. Without regard to logistic or cost consid- 
erations, we speculate that this technique might actually be 
very successful in this setting, because the error attributable 
to inter-allelic MP ligation proved to be very low. Neverthe- 
less, it is likely that one would have to break down the ana- 
lysis of an entire genome into smaller haplotype units, in 
order to maintain high confidence of the phase calls. We 
would anticipate that the size of these units would be 
similar to what can be achieved by optimal combina- 
tions of IrPCR and MP protocols, as described above. 

Two limitations that we foresee for accurate phasing are 
highly homologous genes and gene duplications or other 
copy number changes. In either of these cases, we would 
anticipate phasing errors to increase due to mis -assignment 
of reads. In addition, increases in gene copy number would 
exponentially increase the number of possible phase com- 
binations for any given combination of polymorphic 
positions, increasing computational requirements and 
decreasing ultimate haplotyping accuracy, and in some 
cases, phase assignments might be impossible. 

Conclusions 

In summary, compared with previous approaches, our 
MP NGS sequencing technique is a simple solution to 
the problem of accurately phased genotyping for many 
recessive diseases, and perhaps, many other genetic 
phasing problems. The method could be adapted to 
other NGS platforms since they are all based on deriving 
sequences by aligning large numbers of overlapping 
reads. As clinical molecular diagnosis rapidly approaches 
massively parallel sequencing as the preferred assay 
method, it could serve as a cost-effective way to obtain a 
completely resolved set of haplotypes for single genes, 
panels of related genes, or even significant portions of 
chromosomes. 
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